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1 . (Original) A method for determining a degree of similarity between documents, the method 
comprising the steps of: 

storing, for at least two documents, labeled tree representations of respective 
documents; 

storing, for at least two documents, path representations relating to paths that occur in 
the documents from root nodes to leaf nodes in the labeled tree representations of the respective 
documents; and 

calculating a measure of similarity between two of the documents based upon the 
frequency of occurrence of similar paths specified by the path representations. 

2. (Original) The method as claimed in claim 1, wherein the tree representation is a Document 
Model Object representation. 

3. (Original) The method as claimed in claim 1, further comprising the step of generating a path 
representation for a path of a document as a sequence of labels representative from a root node to 
a leaf node in the labeled tree representation of the document. 

4. (Original) The method as claimed in claim 1 , further comprising the step of storing, as path 
representations, sets of sequenced labels representative of distinct paths in a labeled tree 
representation of a corresponding document. 

5. (Original) The method as claimed in claim 4, further comprising the step of storing a path 
dictionary (Dict pat h S = {p/, p2, Pn}) of distinct paths collated from a tree representation for a 
document. 

6. (Original) The method as claimed in claim 5, further comprising the step of eliminating 
selected paths from the path dictionary {Dict pat h<). 

7. (Original) The method as claimed in claim 6, wherein paths that occur highly frequently or 
highly infrequently are eliminated from the path dictionary (Dict pa ths)- 
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8. (Original) The method as claimed in claim 7, further comprising the step of computing the 
frequency of occurrence (fj(pij) of a path (pi) in a document (dj). 

9. (Original) The method as claimed in claim 8, further comprising the step of computing the 
maximum number of instances (f max = Hiax /y fjipi)) in which a path (pi) in the document (dj) 
occurs. 

10. (Original) The method as claimed in claim 9, further comprising the step of storing a 
representation of the document (dj) as a TV-dimensional vector ([djj f dj2,.., d JN ], where d Jk = 
fj(pk)lfmaxi 1 < k< N) of relative frequencies of occurrence (//(p/ij) of paths (p/i) in the document 

(4)- 

1 1 . (Original) The method as claimed in claim 8, further comprising the step of computing the 
minimum number of instances (f miYl = min,, fj(pij) in which a path (pi) in the document (dj) occurs. 

12. (Original) The method as claimed in claim 10, further comprising the step of computing the 
similarity between a pair of documents (d h di) as a function (sim(d h df)) of metrics relating the 
number of paths common to the respective documents (d t , df). 

13. (Original) The method as claimed in claim 12, wherein the function for computing the 
similarity between a pair of documents (d t , di) 

(sim(d h di) = sim(d n d } ) 

£max(^,£/ ft ) 

is the quotient of a numerator, defined as the sum for all paths (k = 1 ... N) of the minimum 
number of instances (min(^, diuj) in which paths occur in the respective documents (d h di), and 
a denominator, defined as the sum for all paths (k = 1 ... AO of the maximum number of instances 
(min(d/*, da)) in which paths occur in the respective documents (d h di). 
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14. (Original) The method as claimed in claim 1, wherein the tree representation of a document 
includes a positional index, which represents, for a node (n\ the number of previous sibling 
nodes with the same label as that of node {n). 

15. (Original) The method as claimed in claim 14, further comprising the step of storing as a path 
representation a set that defines positional information of sibling nodes under a parent node. 

16. (Original) The method as claimed in claim 15, further comprising the step of storing precise 
path representations that precisely define a document structure, and generalised path 
representations that partially generalise structural aspects of precise path representations of a 
document. 



17. (Original) The method as claimed in claim 16, wherein the step of calculating the measure of 
similarity involves determining a total number of precise path representations of one document 
that are either shared by the other document, or are a subsumed subset of at least one of the 
generalised path representations of the other document. 

18. (Original) The method as claimed in claim 17, further comprising the step of normalising the 
measure of similarity by a term that represents the number of unique path representations shared 
by the two documents. 

19. (Original) The method as claimed in claim 18, wherein the number of unique path 
representations is calculated by adding the number of path representations for each document, 
and subtracting from this total the number path representations shared by the two documents. 

20. (Original) The method as claimed in claim 14, further comprising the step of storing as a 
path representation a sequence of terms separated by a delimiting symbol, in which each term is 
represented by a label and a parenthesised predicate that specifies the positional index of the 
term either specifically or generally. 
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Claim 21 (Cancelled). 
Claim 22. (Cancelled). 

Please add the following new claims: 

23. (New) A program storage device readable by computer, tangibly embodying a program of 
instructions executable by said computer to perform a method for determining a degree of 
similarity between documents, the method comprising: 

storing, for at least two documents, labeled tree representations of respective 
documents; 

storing, for at least two documents, path representations relating to paths that occur in 
the documents from root nodes to leaf nodes in the labeled tree representations of the respective 
documents; and 

calculating a measure of similarity between two of the documents based upon the 
frequency of occurrence of similar paths specified by the path representations. 

24. (New) The program storage device in claim 23, wherein said method further comprises the 
tree representation is a Document Model Object representation. 

25. (New) The program storage device in claim 23, wherein said method further comprises the 
step of generating a path representation for a path of a document as a sequence of labels 
representative from a root node to a leaf node in the labeled tree representation of the document. 

26. (New) The program storage device in claim 23, wherein said method further comprises the 
step of storing, as path representations, sets of sequenced labels representative of distinct paths in 
a labeled tree representation of a corresponding document. 

28. (New) The program storage device in claim 23, wherein the tree representation of a 
document includes a positional index, which represents, for a node («), the number of previous 
sibling nodes with the same label as that of node (n). 
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29. (New) A computer system operable for determining a degree of similarity between 
documents, the computer system comprising: 

a first storage unit operable for storing labeled tree representations of respective 
documents for at least two documents; 

a second storage unit operable for storing, for at least two documents, path 
representations relating to paths that occur in the documents from root nodes to leaf nodes in the 
labeled tree representations of the respective documents; and 

a calculator operable for calculating a measure of similarity between two of the 
documents based upon the frequency of occurrence of similar paths specified by the path 
representations, 

30. (New) The computer system device in claim 29, wherein said method further comprises the 
tree representation is a Document Model Object representation. 

31. (New) The computer system device in claim 29, wherein said method further comprises the 
step of generating a path representation for a path of a document as a sequence of labels 
representative from a root node to a leaf node in the labeled tree representation of the document. 

32. (New) The computer system device in claim 29, wherein said method further comprises the 
step of storing, as path representations, sets of sequenced labels representative of distinct paths in 
a labeled tree representation of a corresponding document. 

33. (New) The computer system device in claim 29, wherein the tree representation of a 
document includes a positional index, which represents, for a node (/?), the number of previous 
sibling nodes with the same label as that of node («). 
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